BMC Medical Informatics and Decision Making
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match BMC Medical Informatics and Decision Making's content profile, based on 39 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
Wei, X.; Xao, X.; Hou, J.; Wang, Q.
Show abstract
Background & AimsAccurate assessment of clinical malnutrition using anthropometric and functional indicators could improve the care of elderly trauma patients in intensive care units (ICUs). This study aimed to develop an AI-driven malnutrition assessment toolbox based on a minimal set of clinically feasible indicators. MethodsMultiple machine learning models, including logistic regression, support vector machines, k-nearest neighbors, decision trees, random forests, XGBoost, and neural-network-based ensemble models, were developed using different indicator configurations from a clinically collected patient dataset. Models were trained using baseline and longitudinal measurements to predict malnutrition risk. SHAP analysis was used to interpret the importance of selected indicators. ResultsBaseline (Day 1) data alone did not provide a reliable prediction, whereas longitudinal measurements substantially improved performance. Models based on a minimal indicator set, including bilateral mid-upper arm circumference, calf circumference, and key static variables, outperformed models using the full indicator set. Tree-based methods consistently outperformed linear and distance-based models, with the three-time-point XGBoost achieving the best individual performance. Neural-network-based ensemble models further improved predictive stability. The best overall performance was achieved by the ensemble model using the minimal indicator set from Day 1 and Day 3. SHAP analysis confirmed the importance of the selected indicators. ConclusionsThis AI-driven toolbox provides an efficient and clinically feasible approach for early malnutrition assessment in elderly trauma patients in the ICU. Its strong performance with a minimal indicator set supports its potential for integration into clinical workflows and future digital twin systems for intelligent nutritional management.
Nayyar, C.; Xu, H. H.; Bates, A. T.; Conati, C.; Hilbers, D.; Avery, J.; Raman, S.; Fayaz-Bakhsh, A.; Nunez, J.-J.
Show abstract
Background: Artificial intelligence (AI) has rapidly garnered interest in healthcare, with research showing promise to improve quality, efficiency, and outcomes. Cancer care's multidisciplinary nature and high coordination demands are well positioned to benefit from AI. While attitudes in the uptake of evidence and toward the implementation of AI in medicine has been explored generally, literature remains scarce with specific regards to AI in cancer care. This study sought to understand how perspectives of both patients and professionals are essential for guiding responsible, effective implementation of evidence-based (EB) AI in cancer care. Methods: We conducted a workshop at the 2024 British Columbia (BC) Cancer Summit (Vancouver, Canada). Discussions addressed three guiding questions: concerns, benefits, and priorities for AI in cancer care. Responses from 48 workshop participants (patients and families, AI/computer science/cancer researchers, clinicians and allied health professionals, information technology professionals, healthcare administrators) underwent structured conceptualization by concept mapping, leveraging multidimensional scaling and hierarchical cluster and subcluster analysis to produce visual and quantitative maps of stakeholder priorities. Results: A total of 265 statements on perceived benefits, concerns, and priorities related to the implementation of AI in cancer care were generated from the workshop and underwent concept mapping. Two clusters were identified; Cluster 1 focused on "Challenges and Safeguards for AI Implementation," and Cluster 2 focused on "Clinical Benefits and Efficiency Gains." Subcluster analysis distinguished 8 thematic subclusters (4 per cluster). Both mean importance (P < .001) and feasibility (P < .001) ratings were significantly higher for Cluster 2. No differences were found between ratings by clinical and nonclinical professionals. Further go-zone analysis classified statements according to their relative superiority/inferiority in importance and feasibility compared to the overall average. Conclusions: Stakeholder ratings were higher for statements describing clinical benefits and efficiency gains than for those describing challenges and safeguards for AI implementation in cancer care. Concept mapping analysis distinguished between workflow-aligned AI applications, perceived as ready for implementation, and system-level governance requirements requiring longer-term investment. Present findings provide a structured, stakeholder-informed framework for prioritizing and sequencing AI implementation efforts in cancer care, constituting a practical blueprint to catalyze meaningful progress.
Ahammed, F.
Show abstract
Fraud in the health landscape is an aggravating issue, with far-reaching consequences burdening the financial stability of the health industry and threatening the quality of medical care. It results from vulnerabilities within the current healthcare framework that are exploited by the fraudsters in their favor. In spite of many developed models that aim to detect fraudulent patterns in insurance claims, the accuracy of such models frequently suffers as a result of the imbalance issue of the Medicare dataset and irrelevant features. This study ventures to improve detection performance and accuracy by employing a deep learning model along with data sampling and feature selection techniques. Comparative analysis among different combinations is conducted to determine their efficacy to enhance the accuracy of the fraud detection model. Hence, the suggested model clearly demonstrates that a combination of myriad data sampling and feature selection techniques is helping to improve accuracy and performance. The accuracy was thus 95.4%, with negligible evidence of overfitting detected using both Chi-square and Synthetic Minority Over-sampling (SMOTE) techniques. Ultimately, the study findings underscore the significance of employing combined techniques instead of using only the baseline deep learning model for better performance in detecting Medicare insurance fraud.
Bressman, E.; Auerbach, A.; Keniston, A.; Jens, C.; Ranji, S.
Show abstract
Introduction: The use of artificial intelligence (AI) by clinicians has increased rapidly in recent years, with large language models (LLMs) emerging as tools that can equal clinician diagnostic performance in simulated settings. However, limited data exist regarding physicians use of LLMs in real-world clinical practice. This study aimed to evaluate the frequency of LLM use among practicing hospitalists, identify which LLMs are most commonly utilized, and assess hospitalists' perceptions of the benefits and limitations of LLM use in clinical care. Methods: We conducted a cross-sectional survey study of academic hospital medicine faculty across 8 institutions within the Hospital Medicine Reengineering Network (HOMERuN), a collaborative research consortium. Eligible participants included hospitalists practicing within participating HOMERuN sites during the study period. The survey assessed the frequency of LLM use, types of LLMs used, clinical applications, and physician perceptions regarding usefulness, efficiency, and concerns associated with LLM adoption. Results: 170 respondents (67.1%) reported ever using an LLM in clinical practice. Among LLM users, OpenEvidence was the most used tool (88.9%), followed by ChatGPT (58.5%), Google Gemini (26.9%), and Microsoft Copilot (20.5%). Only a minority of hospitalists reported using LLMs daily while seeing patients. The most common use cases of LLMs were answering diagnostic (77.1%) and management (77.6%) questions. A majority also reported using LLMs to identify or summarize primary literature (60.0%). Lack of trust in outputs (49.8%), uncertainty around institutional policies (48.6%), and lack of access to secure applications (43.1%) were cited as the most frequent barriers to using LLMs in practice. Discussion: The use of LLMs in clinical practice is already widespread, though regular or daily use is not yet typical. Concerns regarding reliability, patient privacy, and safe integration into clinical workflows remain significant barriers to broader adoption. The responsible implementation of LLMs in hospital medicine will require addressing these barriers.
Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.
Show abstract
Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.
Nakagawa, S.; Yamamoto, A.
Show abstract
Cross-national alignment of branded food databases is essential for international nutritional epidemiology but lacks standardized methods. Existing approaches - including food ontologies, domain-specific fine-tuned language models, and manual expert mapping - require either substantial infrastructure or do not scale to thousands of items. We propose an unsupervised evaluation framework for large language model (LLM)-based food database alignment that requires no ground-truth labels. Using the Japan Branded Food Database (JBFD; 9,519 items, 71 mid-level categories) and USDA FoodData Central (448 categories) as a case study, we introduce two complementary metrics: weighted centroid distance (nutritional proximity between matched category pairs) and dominant category share (structural consistency of category-level assignments). We then conducted a systematic ablation study across eight input conditions (A-H), varying combinations of product name, nutrient profile, and semantic category label. Results showed that nutrient-only inputs yielded poor structural consistency despite low centroid distances, while semantic category labels achieved the highest dominant category share (89.3%) but introduced circularity due to their LLM-derived origin. Among circularity-free conditions, product name combined with minimal nutrient information (energy, protein, salt; condition E) achieved the best balance of centroid distance (0.471) and dominant category share (65.8%). Model comparison across Claude Haiku, Sonnet, and Opus confirmed that NO_MATCH rates were consistent across model sizes (12-14%), suggesting that prompt design contributes more to alignment quality than model scale. These findings provide practical guidance for input design in LLM-based food database alignment without ground-truth annotation.Sonnet 4.6
Lukhele, N.; Mostafa, F.
Show abstract
ObjectiveTo develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. MethodsA clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs [≥]35 years) and gender. SHAP was developed for model interpretability. ResultsEnsemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged [≥]35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. ConclusionThis study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.
Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.
Show abstract
Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.
Baroud, S.
Show abstract
Migraine detection and sentiment analysis in healthcare have become increasingly important, particularly with the rise of social media platforms like Twitter, where users often share their personal health experiences. This study presents MASHA (Multi-Agent System for Healthcare Sentiment Analysis), an artificial intelligence (AI)-driven framework that integrates multiple machine learning (ML) models for sentiment analysis of Arabic tweets related to migraines. The system leverages a multi-agent architecture to handle tasks such as data acquisition, pre-processing, model training and real-time decision-making. Key ML models, including Support Vector Machines (SVM), Naive Bayes (NB) and Logistic Regression (LR), are integrated using ensemble techniques, leading to improved classification performance. Experiments conducted on a dataset of Arabic tweets demonstrate that MASHA outperforms traditional methods, achieving an accuracy of 90.0% and an F1-score of 89.46%. Moreover, the system's scalability and flexibility make it suitable for real-time public health monitoring, offering valuable insights into patient experiences and public sentiment regarding healthcare services. MASHA's adaptability suggests its potential application for analysing other healthcare-related conditions, reinforcing the system's scalability and broader relevance. Future work will focus on incorporating deep learning (DL) models and expanding the dataset with content from additional social media platform.
Van, T. A.
Show abstract
BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.
Gatto, J.; Yang, J.; Seegmiller, P.; Rahat, R.; Burdick, T.; Preum, S. M.
Show abstract
Patient portal messaging has become a primary channel for asynchronous clinical communication, it spans a wide range of content, from symptom reports and medication concerns to administrative requests. Despite this volume and diversity, there is no formal representation for what a portal message contains: no vocabulary for the clinical and administrative events it describes, or for the attributes of those events that the patient has actually disclosed. Without such a representation, it is difficult to systematically analyze portal communication, assess message completeness, or build downstream tools that depend on structured input, such as automated triage, response drafting, and follow-up question generation. A clinical event schema, grounded in real portal messages and reviewed by clinicians, would provide this missing foundation. We introduce a clinical event ontology for patient portal messages, containing 8 event types and 70 roles that span clinical content (symptoms, medications, diagnostic tests, treatment responses, patient history) and administrative content (medical needs, logistics, social factors). The ontology was developed iteratively in collaboration with clinical expert and human evaluation. As a downstream application, we use the ontology to characterize the event types and roles most frequently sought in clinician follow-up questions, which provides insight of what clinicians ask about when reading portal messages.
Ye, L.; Lyu, B.; Yang, Q.; Mou, X.; Nawawonganun, R.; Laohasiriwong, W.
Show abstract
Background: Multi-drug resistant Bacterial (MDRB) Infections in the intensive care units (ICUs) substantially elevate patient mortality, prolong hospital stays, and impose heavy healthcare cost burdens. Existing predictive models for ICU-acquired MDRB infection predominantly focus on static admission-risk assessment, lacking the capacity to leverage longitudinal treatment data for dynamic risk re-stratification during the ICU stay. Meanwhile, most models suffer from poor clinical interpretability, overreliance on hard-to-collect biomarkers, or absence of deployable clinical tools, limiting real-world translation. Therefore, there is an urgent need to develop a parsimonious, interpretable tool based on routine cumulative data to guide timely intervention. This study aimed to develop a interpretable model with a web calculator to improve clinical applicability. Methods: In this study, we conducted a retrospective analysis of ICU inpatients at the First Affiliated Hospital of Dali University between January 1, 2023, and January 1, 2026. Using the create Data Partition function in R software (random seed = 42), the dataset was stratified and divided into a training group and a validation group in a 7:3 ratio. Feature selection was performed using the Boruta algorithm to validate variable rationality. A multivariable logistic regression model was constructed and visualized as a nomogram, and its performance was compared with six machine learning algorithms (Random Forest, XG Boost, Neural Network, etc.). Model validation was conducted using receiver operating characteristic curves (ROC), Decision Curve Analysis (DCA), and SHAP value interpretation. Finally, an online R Shiny calculator was developed based on the final model. Results: A total of 3,631 patients were enrolled and divided into a training group (n=2,543) and a validation group (n=1,088) using stratified random sampling. Five independent predictors were identified in the training group, which were hypertension combined with diabetes, antibiotic types, ventilator days, urinary catheter days, and PCT abnormality times. The Logistic regression model achieved an AUC of 0.772 (95%CI: 0.733-0.812) in the validation group, outperforming XG Boost (0.763) and Random Forest (0.703). The model demonstrated excellent calibration (Hosmer-Leme show {chi}{superscript 2} = 1.94, P = 0.9829) and positive net clinical benefit across threshold probabilities of 0%-40%. SHAP analysis aligned with regression-derived variable importance rankings, confirming predictor contributions. An open-access online calculator was successfully deployed (https://dongfangshao666.shinyapps.io/MDR_shiny2/), enabling real-time individualized risk stratification at the bedside. Conclusion: This study developed and validated a dynamic, interpretable multi-drug-resistant bacterial infection risk prediction model requiring only five routinely collected clinical indicators. The model balances robust predictive performance with high transparency, overcoming key limitations of prior tools. The accompanying web calculator supports dynamic risk reassessment throughout the ICU stay, facilitating precise antimicrobial stewardship, targeted infection control interventions, and optimized resource allocation, bridging the gap between statistical modeling and frontline clinical decision-making.
Liu, Y.; Concepcion, D.
Show abstract
This research proposes an anomaly detection and assurance framework. It is mainly aimed at providing a framework for anomaly detection and assurance in AI-driven Nurse Call Systems (NCS) during operation. This study detects abnormal behaviors through simulating real call logs, injecting controllable anomalies, and using a lightweight Isolation Forest model. The final visualization results are presented through an interactive dashboard. Our research focuses mainly on the medical environment, which has characteristics of being delay-sensitive and safety-critical. A distinctive feature of this research is that it can effectively enhance the reliability of system operation without relying on complex deep model proprietary data, while maintaining safety and interpretability. The framework design emphasizes reproducibility while maintaining low computational overhead. The purpose is to enable rapid deployment of this framework on resource-constrained edge devices. Preliminary experimental results show that this method can maintain a reasonable precision rate. Additionally, when detecting delay-type anomalies, the results indicate a high recall rate. Moreover, to reflect the systems performance in real scenarios, the framework detects delay metrics and hourly alarm quantity metrics, and reports Precision-Recall curves and their confidence intervals. Future work will consider introducing time, context features, and explainability analysis modules. The aim is to improve the models accuracy and further meet the medical industrys requirements for auditability. This work focuses on the operational safety and reliability of AI-enabled Nurse Call Systems, addressing runtime failure modes that are underrepresented in current healthcare AI deployments. Rather than proposing new learning models, the contribution lies in a reproducible, interpretable assurance framework suitable for real clinical infrastructure. To ensure transparency and reproducibility, all code, cleaned datasets, experiment scripts, and an interactive Streamlit demo--allowing users to upload their own CSVs -- are publicly released as open research artifacts (Zenodo DOI: 10.5281/zenodo.17767143).
Enikeev, R.; Moldovan, M.; Chu, M.; Amalraj, A.; Koli, P. P.; Abdul, S. S.; Sivaraj, H.; Iqbal, U.; Toh, C. K.
Show abstract
Background: Structuring oncology clinical notes into registry-grade variables is essential for research and care but remains labour-intensive and error-prone. Objective: To develop and evaluate a privacy-preserving large language model pipeline for oncology registry abstraction in a real-world clinical setting. Methods: We deployed an open-source Meta Llama 3.3 70B-based pipeline to extract over 50 variables from 6,700 oncology notes at a cancer centre in Singapore. Data were de-identified locally using a Hide-In-Plain-Sight approach, ensuring no identifiable data left hospital infrastructure. Performance was assessed on 200 randomly sampled notes with adjudicated ground truth. A structure-aware framework classified outputs as correct, missing, spurious, or incorrect. Results: F1 scores were high across variables, including diagnosis (97.2%), histology (95.8%), stage (92.6%), biomarkers (91.4%), and treatments (88.1%). Transferability testing on 50 external notes showed strong performance for core variables. Conclusions: Privacy-preserving LLMs can achieve near-human-level accuracy for oncology abstraction, with structure-aware evaluation enabling more clinically meaningful assessment. Keywords: Oncology Registry Abstraction, Privacy-Preserving Deployment, Clinical Information Extraction, Structure-Aware Evaluation, Large Language Models, Template-Filling Metrics
Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.
Show abstract
Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.
Li, L.; Sondh, S.; Sondh, H. K.; Stewart, R.; Roberts, A.
Show abstract
BackgroundExperiences of violence are reported frequently by mental health service users, victims of violence are at a greater risk of mental health disorders, and violence may sometimes occur as a consequence of a mental disorder. Electronic health records (EHRs) are an important source of information about healthcare, and its social context. Occurrences of violence are not routinely recorded as structured data in EHRs but are however recorded in the free text narrative. ObjectiveOur objective was to address this research gap by creating a natural language processing (NLP) application that extracts information related to various forms of violence (physical (non-sexual), sexual, emotional, and financial) from the EHR of a large south London mental health service. Additionally, we aimed to extract features concerning the patients role (victimization vs. perpetration), timing (recent vs. historic), domestic context, presence (actual, threat, or unclear), and polarity (affirmed, abstract, or negated) of the violent behaviors. MethodsTwo raters independently annotated 6,500 randomly selected segments of clinical notes containing violence-related keywords from a large mental healthcare provider in South London, each containing 400 characters (with approximately 200 characters before and after the keyword) after rigorous training using a pre-defined and approved coding book provided by senior professionals. We utilized 90% of the annotated data for fine-tuning a multi-label BERT model (employing 5-fold cross-validation) with the remaining 10% of data reserved for a blind test. ResultsThe model performed well on the blind test set for emotional violence (F1= 0.89), financial violence (0.88), physical (non-sexual) violence (0.84), and unspecified violence (0.81), and the patient role (0.89 as perpetrator; 0.84 as victim), polarity (0.89 for affirmed behavior), presence (0.95 for actual violence), and domestic settings (0.88). We were unable to achieve satisfactory results in capturing temporal aspects (0.65 for past violence). ConclusionsWe were able to improve substantially on previously developed NLP for ascertaining violence in routine mental health records, providing novel opportunities for both surveillance and research.
Jiang, Y.; Ma, S.; Liang, A.; Kim, G.; Acharya, A.; Mony, S.; Punnathanam, S.; Makeown, J.; Jose, J.; Shieh, L.; Pham, T.; Ng, A. Y.; Chen, J. H.
Show abstract
This study explores integrating machine learning into electronic medical record systems to predict stability of inpatient lab tests. A smart alerts system was developed and tested at Stanford Hospital. The system identifies stable lab results, advising clinicians on test ordering. Live deployment showed desired precision at good recall in predicting test result stability, with suggestions for system optimization identified. This approach may significantly decrease low-yield testing and enhance personalized clinical decision-making.
Kizilaslan, B.; Mehlum, L.
Show abstract
Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning
Osborne, T.; Mahmud, T.; Zheng, X.; Jampala, S.; Abbasi, S.; Hong, S.; Kranz, K.; Lee, S.; Ng, P.; Odekon, K.; Schachter, L.; Sexton, R.; Spinnato, T.; Tharakan, M.; Wu, Z.; Wang, F.; Wong, R.
Show abstract
Although large language models (LLMs) have shown promise for discharge summary generation, their value may be greater in longer hospitalizations, where increasing documentation volume and complexity increase both clinician burden and the risk of communication failures during transitions of care. Prior evaluations of LLM-generated discharge summaries have largely involved shorter stays and have rarely examined receiving-clinician priorities or incidental finding reporting. We compared LLM-generated and human-authored discharge summaries for 60 Internal Medicine hospitalizations lasting 7 to 21 days, with paired assessment by hospitalists and primary care physicians (PCPs). Clinician reviewers preferred LLM-generated summaries for 95% of encounters and rated them higher for quality, readability, factuality and completeness. PCPs, the primary recipients responsible for post-discharge care, found that LLM-generated summaries were better for understanding and communicating hospital care to patients, and providing follow-up care. LLM-generated summaries had fewer annotated errors, primarily due to fewer omissions, without increased estimated harm potential or likelihood compared with human-authored summaries. Benefits of LLM-generated summaries were especially salient for PCPs, who identified more omissions with greater downstream likelihood of harm than hospitalists. This underscores the importance of designing transition documents around the needs of clinicians assuming care post-discharge. LLM identification of radiology incidental findings was generally accurate and appropriate, suggesting potential to improve follow-up of clinically relevant findings. These findings extend prior work by demonstrating clinical value of LLMs in summarizing longer, complex hospitalizations and highlighting the value of stakeholder-centered design in clinical AI systems. Together, they support supervised LLM-assisted discharge summarization as a tool to reduce cognitive burden, improve documentation quality, and enhance transition-of-care communication.
van Wijk, R. J.; Schoonhoven, A. D.; de Vree, L.; Ter Horst, S.; Gaidhane, C.; Alcaraz, J. M. L.; Strodthoff, N.; ter Maaten, J. C.; Bouma, H. R.; Li, J.
Show abstract
Purpose: Early recognition of deterioration in patients with suspected infection at the emergency department (ED) is important. Current clinical scoring systems show limited discriminative performance for early deterioration. Continuous electrocardiogram (ECG) recordings may offer additional dynamic physiological information that can enhance early prediction of deterioration in patients with suspected infection. Methods: We developed a multimodal, ECG-derived spectrogram-based pipeline to predict deterioration within 48 hours of ED admission. We used the first 20 minutes of ECG recordings for the spectrograms. We compared the model with the National Early Warning Score (NEWS), quick Sequential Organ Failure Assessment (qSOFA), a baseline model with vital parameters, sex, and age, and a Heart Rate Variability (HRV) derived model. Results: In this study, 1321 patients were included, of whom 159 (12%) deteriorated. The multimodal model combining baseline data with spectrograms showed the best overall performance, with an Area Under the Receiver Operating Characteristic (AUROC) of 0.788, followed by the baseline model (age, sex, triage vitals) alone, with an AUROC of 0.730. The HRV-only model and the qSOFA showed the lowest performance (AUROC 0.585 and 0.693, respectively). Conclusion: This study shows that ECG-derived multimodal spectrogram models outperform those based solely on vital signs and HRV features, as well as established clinical scores such as NEWS and qSOFA. Spectrogram analysis represents a promising approach to enhance early risk stratification and support clinical decision-making for patients with suspicion of infection in the ED.